An introduction to the tidyverse
tidyr, dplyr and purr
An introduction to the tidyverse
The contents of this document illustrate the basic use and advantages of the tidyverse suite of packages for data organization and management. The contents and examples were adapted from Ameztegui and Granda (2019).
Are you familiar with R and the tidyverse?
The tidyverse
The tidyverse is a collection of R packages designed for data science, as a suite aimed at easening the data analysis in all its steps.
Created by Hadley Wickham, chief scientist of RStudio, and author of more than 30 R packages (readr, ggplot2, plyr, devtools, roxygen2, rmarkdown…). All packages share an underlying design philosophy, grammar, and data structures.
So what’s exactly in the tidyverse?
ggplot2a system for creating graphics, based on the Grammar of Graphicsreadra fast and friendly way to read rectangular data (csv, txt…)tibblea tibble is a re-imagining version of the data frame, keeping what time has proven to be effective and throwing out what has notstringrprovides a cohesive set of functions designed to make working with strings as easy as possibleforcatsprovides a suite of useful tools that solve common problems with factorsdplyrprovides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challengestidyrprovides a set of functions that help you get to tidy datapurrrenhances functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors
Installation and use
Install all the packages in the tidyverse by running
install.packages("tidyverse")- Run
library(tidyverse)to load the core tidyverse and make it available in your current R session. - Learn more about the tidyverse package at http://tidyverse.tidyverse.org.
Or check the cheatsheets
Before we start…
Neither
tidyr, nordplyrorpurrrdo anything that can’t be done with base R code,applyfamily functions,forloops or other packages.Designed to be more efficient (in time), easier to read and easier to use. More intuitive to use, specially for beginners (it may require some adaptation if you are used to base R code).
Valid mostly for data.frames. For other formats (matrices, arrays)
plyrcan be used.
Our data
plots [11858 x 15]: all plots from the Third Spanish Forest Inventory (IFN3) in Cataloniatrees [111756 x 12]: all trees with dbh > 7.5 cm measured in both IFN2 and IFN3species [14778 x 15]: number of trees per hectare in each plot, by species and size classcoordinates [11858 x 6]: X and Y UTM coordinates of each plot.leaf [10447 x 3]: leaf biomass and carbon content for those IFN3 plots where they were available
A brief introduction of data managment with dplyr
Let’s have a look at the data
Data inspection must be the first thing we do before starting any kind of analysis.
trees## # A tibble: 111,756 x 10
## Codi Provincia Especie Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3
## <fct> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 080001 08 022 7 8.3 31.8 20 20.3 18.9 9
## 2 080002 08 476 38 9.1 31.8 35 34 32.4 9
## 3 080003 08 021 25 7 31.8 25 24.8 17.6 11
## 4 080004 08 021 28 8.89 31.8 15 16.8 12.6 9.5
## 5 080006 08 021 19 11.2 14.1 35 34.0 30.9 13
## 6 080007 08 021 32 12 14.1 35 33.1 28.2 10
## 7 080008 08 243 40 7.8 31.8 15 15 13.2 6
## 8 080009 08 045 16 5.09 31.8 20 17.5 15.3 7
## 9 080010 08 243 47 26.9 5.09 65 67.4 66.8 16.5
## 10 080013 08 022 44 2.7 127. 15 15.1 12.6 9.5
## # ... with 111,746 more rows
plots## # A tibble: 11,858 x 15
## Codi Provincia Cla Subclase FccTot FccArb FechaIni HoraIni
## <fct> <chr> <fct> <fct> <int> <int> <date> <dttm>
## 1 0800~ 08 A 1 80 70 2001-07-09 2017-11-26 09:44:00
## 2 0800~ 08 A 1 80 70 2001-08-06 2017-11-26 09:18:58
## 3 0800~ 08 A 1 90 80 2001-08-06 2017-11-26 12:08:09
## 4 0800~ 08 A 1 90 50 2001-07-09 2017-11-26 13:23:23
## 5 0800~ 08 A 1 70 60 2001-08-03 2017-11-26 09:11:28
## 6 0800~ 08 A 1 90 90 2001-08-01 2017-11-26 13:00:33
## 7 0800~ 08 A 1 90 90 2001-08-07 2017-11-26 10:08:15
## 8 0800~ 08 A 1 70 60 2001-08-03 2017-11-26 12:12:03
## 9 0800~ 08 A 1 80 70 2001-08-02 2017-11-26 09:00:16
## 10 0800~ 08 A 1 80 80 2001-06-14 2017-11-26 12:34:21
## # ... with 11,848 more rows, and 7 more variables: FechaFin <date>, HoraFin <dttm>,
## # Rocosid <int>, Textura <int>, MatOrg <int>, PhSuelo <int>, FechaPh <date>
species## Warning: Detecting old grouped_df format, replacing `vars` attribute by `groups`
## # A tibble: 14,778 x 15
## # Groups: Codi, Especie [14,778]
## Codi Especie CD_10 CD_15 CD_20 CD_25 CD_30 CD_35 CD_40 CD_45 CD_50 CD_55 CD_60
## * <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 080001 022 0 159. 31.8 111. 60.1 19.2 5.09 0 0 0 0
## 2 080002 021 0 0 0 0 0 74.2 28.3 63.7 0 0 0
## 3 080002 022 0 0 0 173. 31.8 0 0 0 0 0 0
## 4 080002 476 0 0 0 0 0 31.8 0 0 0 0 0
## 5 080003 021 0 0 0 31.8 0 0 0 0 5.09 0 0
## 6 080003 022 0 127. 0 0 46.0 127. 0 14.1 0 0 0
## 7 080004 021 0 31.8 0 0 31.8 0 0 0 0 0 0
## 8 080005 071 0 0 0 14.1 5.09 31.8 0 0 0 0 31.8
## 9 080005 243 0 0 14.1 0 14.1 0 5.09 0 0 0 0
## 10 080005 278 0 286. 31.8 0 0 31.8 0 0 0 0 0
## # ... with 14,768 more rows, and 2 more variables: CD_65 <dbl>, CD_70 <dbl>
class(trees)## [1] "tbl_df" "tbl" "data.frame"
Tibbles, are not usual data.frames:
class
tbl_dfprint only 10 rows by default
informs about variable types
Besides that, it is like a data.frame (it is a data.frame)
glimpse(trees)## Observations: 111,756
## Variables: 10
## $ Codi <fct> 080001, 080002, 080003, 080004, 080006, 080007, 080008, 080009...
## $ Provincia <chr> "08", "08", "08", "08", "08", "08", "08", "08", "08", "08", "0...
## $ Especie <fct> 022, 476, 021, 021, 021, 021, 243, 045, 243, 022, 021, 021, 07...
## $ Rumbo <dbl> 7, 38, 25, 28, 19, 32, 40, 16, 47, 44, 13, 9, 9, 25, 199, 6, 4...
## $ Dist <dbl> 8.30, 9.10, 7.00, 8.89, 11.19, 12.00, 7.80, 5.09, 26.89, 2.70,...
## $ N <dbl> 31.83, 31.83, 31.83, 31.83, 14.14, 14.14, 31.83, 31.83, 5.09, ...
## $ CD <dbl> 20, 35, 25, 15, 35, 35, 15, 20, 65, 15, 20, 30, 45, 20, 35, 30...
## $ DiamIf3 <dbl> 20.30, 34.00, 24.80, 16.85, 34.05, 33.10, 15.00, 17.50, 67.40,...
## $ DiamIf2 <dbl> 18.90, 32.45, 17.55, 12.65, 30.90, 28.15, 13.25, 15.30, 66.80,...
## $ HeiIf3 <dbl> 9.00, 9.00, 11.00, 9.50, 13.00, 10.00, 6.00, 7.00, 16.50, 9.50...
dplyr
dplyr is a grammar for data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:
The 5 main verbs of dplyr
filter: keep/select the rows that match a given conditionselect: keep columns by namearrange: sort rows acording to a given variablemutate: transform existent variables or create new onessummarise: do some summary statistics and reduce data
Common structure
Most of the tidyverse verbs/functions share certain elements in common:
verb(data, ...)- First argument: data (as data.frame or
tbl_df) - The rest of arguments specify what to do with the data frame
- Output is always another data frame (tbl_df or data.frame)
- Unless we are assigning (
<-), functions never modifies the original data
Selecting rows (filter)
Use filter() to choose rows/cases/observations where a given condition(s) is true.
Select trees taller than 6 meters:
filter(trees, HeiIf3 > 6)## # A tibble: 100,247 x 10
## Codi Provincia Especie Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3
## <fct> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 080001 08 022 7 8.3 31.8 20 20.3 18.9 9
## 2 080002 08 476 38 9.1 31.8 35 34 32.4 9
## 3 080003 08 021 25 7 31.8 25 24.8 17.6 11
## 4 080004 08 021 28 8.89 31.8 15 16.8 12.6 9.5
## 5 080006 08 021 19 11.2 14.1 35 34.0 30.9 13
## 6 080007 08 021 32 12 14.1 35 33.1 28.2 10
## 7 080009 08 045 16 5.09 31.8 20 17.5 15.3 7
## 8 080010 08 243 47 26.9 5.09 65 67.4 66.8 16.5
## 9 080013 08 022 44 2.7 127. 15 15.1 12.6 9.5
## 10 080014 08 021 13 8.89 31.8 20 19.2 14.2 11.5
## # ... with 100,237 more rows
Select trees in province 50, Zaragoza:
filter(trees, Provincia == '50')## # A tibble: 0 x 10
## # ... with 10 variables: Codi <fct>, Provincia <chr>, Especie <fct>, Rumbo <dbl>,
## # Dist <dbl>, N <dbl>, CD <dbl>, DiamIf3 <dbl>, DiamIf2 <dbl>, HeiIf3 <dbl>
filter(trees, CD %in% c(45, 70))## # A tibble: 2,552 x 10
## Codi Provincia Especie Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3
## <fct> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 080016 08 071 9 26.9 5.09 45 46.0 45 19
## 2 080113 08 021 19 11.1 14.1 70 72.6 70.7 20.9
## 3 080686 08 042 43 7.09 31.8 45 44.3 42.5 12.5
## 4 080721 08 042 16 15.8 5.09 45 46.2 39.8 17.5
## 5 080743 08 042 3 9.69 31.8 45 46.1 40.7 25.5
## 6 081271 08 024 93 10.8 14.1 45 44.0 39.0 17.3
## 7 081278 08 024 87 10.1 14.1 45 44.6 38.9 15.1
## 8 081354 08 026 16 8.6 31.8 45 43.4 30.8 16.9
## 9 081402 08 054 51 5.19 31.8 70 102. 95.5 17.7
## 10 081943 08 024 1 10.3 14.1 45 43.2 33.8 11.7
## # ... with 2,542 more rows
Exercise 1
Let's find those plots in IFN3n (`plots` data frame) that:
1.1 Are located either in Barcelona (08) or Girona (17)
1.2 Were measured **in** January 2001
1.3 It took **more** than 2 hours to measure (7200s)
Selecting columns (select)
Select() it’s the function in dplyr to choose variables from a tbl. It keeps only the variables you mention.
trees## # A tibble: 111,756 x 10
## Codi Provincia Especie Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3
## <fct> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 080001 08 022 7 8.3 31.8 20 20.3 18.9 9
## 2 080002 08 476 38 9.1 31.8 35 34 32.4 9
## 3 080003 08 021 25 7 31.8 25 24.8 17.6 11
## 4 080004 08 021 28 8.89 31.8 15 16.8 12.6 9.5
## 5 080006 08 021 19 11.2 14.1 35 34.0 30.9 13
## 6 080007 08 021 32 12 14.1 35 33.1 28.2 10
## 7 080008 08 243 40 7.8 31.8 15 15 13.2 6
## 8 080009 08 045 16 5.09 31.8 20 17.5 15.3 7
## 9 080010 08 243 47 26.9 5.09 65 67.4 66.8 16.5
## 10 080013 08 022 44 2.7 127. 15 15.1 12.6 9.5
## # ... with 111,746 more rows
Selecting a single column/variable:
select(trees, DiamIf3)## # A tibble: 111,756 x 1
## DiamIf3
## <dbl>
## 1 20.3
## 2 34
## 3 24.8
## 4 16.8
## 5 34.0
## 6 33.1
## 7 15
## 8 17.5
## 9 67.4
## 10 15.1
## # ... with 111,746 more rows
Selecting all columns but the targeted one:
select(trees, -Codi)## # A tibble: 111,756 x 9
## Provincia Especie Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3
## <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 08 022 7 8.3 31.8 20 20.3 18.9 9
## 2 08 476 38 9.1 31.8 35 34 32.4 9
## 3 08 021 25 7 31.8 25 24.8 17.6 11
## 4 08 021 28 8.89 31.8 15 16.8 12.6 9.5
## 5 08 021 19 11.2 14.1 35 34.0 30.9 13
## 6 08 021 32 12 14.1 35 33.1 28.2 10
## 7 08 243 40 7.8 31.8 15 15 13.2 6
## 8 08 045 16 5.09 31.8 20 17.5 15.3 7
## 9 08 243 47 26.9 5.09 65 67.4 66.8 16.5
## 10 08 022 44 2.7 127. 15 15.1 12.6 9.5
## # ... with 111,746 more rows
Selecting several columns:
select(trees, DiamIf2, DiamIf3)## # A tibble: 111,756 x 2
## DiamIf2 DiamIf3
## <dbl> <dbl>
## 1 18.9 20.3
## 2 32.4 34
## 3 17.6 24.8
## 4 12.6 16.8
## 5 30.9 34.0
## 6 28.2 33.1
## 7 13.2 15
## 8 15.3 17.5
## 9 66.8 67.4
## 10 12.6 15.1
## # ... with 111,746 more rows
Selecting a range of columns:
select(trees, Codi:Dist)## # A tibble: 111,756 x 5
## Codi Provincia Especie Rumbo Dist
## <fct> <chr> <fct> <dbl> <dbl>
## 1 080001 08 022 7 8.3
## 2 080002 08 476 38 9.1
## 3 080003 08 021 25 7
## 4 080004 08 021 28 8.89
## 5 080006 08 021 19 11.2
## 6 080007 08 021 32 12
## 7 080008 08 243 40 7.8
## 8 080009 08 045 16 5.09
## 9 080010 08 243 47 26.9
## 10 080013 08 022 44 2.7
## # ... with 111,746 more rows
We can leverage some special functions especifically design to work withwith strings to enhance column extraction:
starts_with(x): names that start with xends_with(x): names that end with xcontains(x): selects all variables whose name contains xmatches(x): selects all variables whose name contains the regular expression xnum_range("x", 1:5, width = 2): selects all variables (numerically) from x01 to x05one_of ("x", "y", "z"): selects variables provided in a character vectoreverything(): selects all variables
Select columns that start with Diam:
select(trees, starts_with('Diam'))## # A tibble: 111,756 x 2
## DiamIf3 DiamIf2
## <dbl> <dbl>
## 1 20.3 18.9
## 2 34 32.4
## 3 24.8 17.6
## 4 16.8 12.6
## 5 34.0 30.9
## 6 33.1 28.2
## 7 15 13.2
## 8 17.5 15.3
## 9 67.4 66.8
## 10 15.1 12.6
## # ... with 111,746 more rows
Exercise 2
Think of three or four ways to select the variables that define the start and finish date of plot measuring.
Sorting rows (arrange)
arrange(trees, Dist)## # A tibble: 111,756 x 10
## Codi Provincia Especie Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3
## <fct> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 170672 17 021 73 0 127. 15 17.2 15.2 13.5
## 2 251882 25 025 174 0.3 127. 15 13.5 10.4 6
## 3 170640 17 045 290 0.3 127. 10 11.4 11.2 6.5
## 4 081544 08 025 176 0.3 127. 20 17.6 14.4 7
## 5 081080 08 025 107 0.4 127. 15 13.8 10.2 8
## 6 430170 43 024 0 0.4 127. 25 23.4 15.0 6.5
## 7 430631 43 021 17 0.4 127. 15 15.6 14 10.7
## 8 171941 17 079 150 0.4 127. 20 20.0 10.6 20
## 9 171990 17 045 34 0.4 127. 25 23.2 19.6 8
## 10 172056 17 045 160 0.4 127. 15 12.7 11.3 8
## # ... with 111,746 more rows
arrange(trees, desc(Dist))## # A tibble: 111,756 x 10
## Codi Provincia Especie Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3
## <fct> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 250008 25 042 398 36.5 5.09 70 74.8 72.3 11.1
## 2 250025 25 031 350 34.8 5.09 70 75.1 71.3 35.7
## 3 250135 25 022 158 33.9 5.09 50 51.8 54.0 9
## 4 250051 25 021 110 33.6 5.09 60 58.2 56.8 6.5
## 5 250025 25 031 238 33.5 5.09 50 51 45.5 20.3
## 6 250010 25 071 214 32.9 5.09 70 85.3 71.6 21
## 7 251788 25 045 163 32.8 5.09 60 61.6 51.7 10.5
## 8 250039 25 031 119 32.5 5.09 65 63.5 62.6 30
## 9 250025 25 031 390 32.1 5.09 45 45.6 43.8 24.6
## 10 171289 17 071 38 31.4 5.09 60 60.8 50.4 26.9
## # ... with 111,746 more rows
Exercise 3
3.1 Sort plots by date and hour of measurement
3.2 Which plots were started to be measured later in the day?
3.3 Which plots took longer to be measured?
Transforming variables (mutate)
mutate() allows to create/calculate new columns/variables in our dataset.
mutate(
trees,
Dist = Dist * 100
)## # A tibble: 111,756 x 10
## Codi Provincia Especie Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3
## <fct> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 080001 08 022 7 830. 31.8 20 20.3 18.9 9
## 2 080002 08 476 38 910 31.8 35 34 32.4 9
## 3 080003 08 021 25 700 31.8 25 24.8 17.6 11
## 4 080004 08 021 28 889 31.8 15 16.8 12.6 9.5
## 5 080006 08 021 19 1119 14.1 35 34.0 30.9 13
## 6 080007 08 021 32 1200 14.1 35 33.1 28.2 10
## 7 080008 08 243 40 780 31.8 15 15 13.2 6
## 8 080009 08 045 16 509 31.8 20 17.5 15.3 7
## 9 080010 08 243 47 2689 5.09 65 67.4 66.8 16.5
## 10 080013 08 022 44 270 127. 15 15.1 12.6 9.5
## # ... with 111,746 more rows
mutate(
trees,
Alometria = DiamIf3 / HeiIf3,
Alometria2 = Alometria * DiamIf2
)## # A tibble: 111,756 x 12
## Codi Provincia Especie Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3 Alometria
## <fct> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0800~ 08 022 7 8.3 31.8 20 20.3 18.9 9 2.26
## 2 0800~ 08 476 38 9.1 31.8 35 34 32.4 9 3.78
## 3 0800~ 08 021 25 7 31.8 25 24.8 17.6 11 2.25
## 4 0800~ 08 021 28 8.89 31.8 15 16.8 12.6 9.5 1.77
## 5 0800~ 08 021 19 11.2 14.1 35 34.0 30.9 13 2.62
## 6 0800~ 08 021 32 12 14.1 35 33.1 28.2 10 3.31
## 7 0800~ 08 243 40 7.8 31.8 15 15 13.2 6 2.5
## 8 0800~ 08 045 16 5.09 31.8 20 17.5 15.3 7 2.5
## 9 0800~ 08 243 47 26.9 5.09 65 67.4 66.8 16.5 4.08
## 10 0800~ 08 022 44 2.7 127. 15 15.1 12.6 9.5 1.59
## # ... with 111,746 more rows, and 1 more variable: Alometria2 <dbl>
Special functions:
if_elseallows to implement a conditional structure in which we determine the output value when that condition was met and the value when it happens otherwise:
mutate(
trees,
Especie = if_else(Especie == '021', 'Pinus sylvestris', 'Other')
)## # A tibble: 111,756 x 10
## Codi Provincia Especie Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3
## <fct> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 080001 08 Other 7 8.3 31.8 20 20.3 18.9 9
## 2 080002 08 Other 38 9.1 31.8 35 34 32.4 9
## 3 080003 08 Pinus sylvestris 25 7 31.8 25 24.8 17.6 11
## 4 080004 08 Pinus sylvestris 28 8.89 31.8 15 16.8 12.6 9.5
## 5 080006 08 Pinus sylvestris 19 11.2 14.1 35 34.0 30.9 13
## 6 080007 08 Pinus sylvestris 32 12 14.1 35 33.1 28.2 10
## 7 080008 08 Other 40 7.8 31.8 15 15 13.2 6
## 8 080009 08 Other 16 5.09 31.8 20 17.5 15.3 7
## 9 080010 08 Other 47 26.9 5.09 65 67.4 66.8 16.5
## 10 080013 08 Other 44 2.7 127. 15 15.1 12.6 9.5
## # ... with 111,746 more rows
Exercise 4
4.1 Get diameter growth (in cm) of each tree between IFN2 (DiamIf2) and IFN3 (DiamIf3)
4.2 Create two new variables with basal area of each tree (in $m^2$ per hectare), both for IFN2 and IFN3. Which is the species of the fastest-growing tree in basal area?
\[ AB = \frac{\pi}{4} · Diam^{2} · N \]
Reducing variables (summarise)
summarise(trees, mean_if3 = mean(DiamIf3))## # A tibble: 1 x 1
## mean_if3
## <dbl>
## 1 23.4
Summary functions
min(x),max(x),quantile(x, p)mean(x),median(x),sd(x),var(x),IQR(x)n(),n_distinct(x)sum(x > 10),mean(x > 10)
Grouped summarise
One of the most common ways in which we create summaries is by grouping on the basis of certain information. That means, we split observations between a number of groups and the calculate the summary statistics.
by_province <- group_by(trees, Provincia)
by_province## # A tibble: 111,756 x 10
## # Groups: Provincia [4]
## Codi Provincia Especie Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3
## <fct> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 080001 08 022 7 8.3 31.8 20 20.3 18.9 9
## 2 080002 08 476 38 9.1 31.8 35 34 32.4 9
## 3 080003 08 021 25 7 31.8 25 24.8 17.6 11
## 4 080004 08 021 28 8.89 31.8 15 16.8 12.6 9.5
## 5 080006 08 021 19 11.2 14.1 35 34.0 30.9 13
## 6 080007 08 021 32 12 14.1 35 33.1 28.2 10
## 7 080008 08 243 40 7.8 31.8 15 15 13.2 6
## 8 080009 08 045 16 5.09 31.8 20 17.5 15.3 7
## 9 080010 08 243 47 26.9 5.09 65 67.4 66.8 16.5
## 10 080013 08 022 44 2.7 127. 15 15.1 12.6 9.5
## # ... with 111,746 more rows
summarise(
by_province,
mean_height_ifn3 = mean(HeiIf3, na.rm = TRUE),
max_height_ifn3 = max(HeiIf3, na.rm = TRUE),
min_height_ifn3 = min(HeiIf3, na.rm = TRUE)
)## # A tibble: 4 x 4
## Provincia mean_height_ifn3 max_height_ifn3 min_height_ifn3
## <chr> <dbl> <dbl> <dbl>
## 1 08 11.2 35 1.6
## 2 17 11.4 38 1.5
## 3 25 11.4 35.8 2
## 4 43 9.21 33 2
summarise(
by_province,
R = cor(HeiIf3, DiamIf3)
)## # A tibble: 4 x 2
## Provincia R
## <chr> <dbl>
## 1 08 0.621
## 2 17 0.600
## 3 25 0.646
## 4 43 0.700
Exercise 5
5.1 Which species showed the largest increase in diameter between IFN2 and IFN3?
5.2 Which province gather the largest number of species?
summarise_if and summarise_at
We can apply a summarising function to a group of variables that share some commmon characteristics that can be tested (i.e. numeric variables)
summarise_if(trees, is.numeric, mean)## # A tibble: 1 x 7
## Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 198. 7.78 55.6 23.4 23.4 20.3 11.1
We can limit the summary by name or using the select helpers (starts_with, ends_with, one_of)
summarise_at(trees, vars(starts_with('Diam')), mean)## # A tibble: 1 x 2
## DiamIf3 DiamIf2
## <dbl> <dbl>
## 1 23.4 20.3
Targeted transformations
The same can be done with mutate:
mutate_if(trees, is.numeric, log)## # A tibble: 111,756 x 10
## Codi Provincia Especie Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3
## <fct> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 080001 08 022 1.95 2.12 3.46 3.00 3.01 2.94 2.20
## 2 080002 08 476 3.64 2.21 3.46 3.56 3.53 3.48 2.20
## 3 080003 08 021 3.22 1.95 3.46 3.22 3.21 2.87 2.40
## 4 080004 08 021 3.33 2.18 3.46 2.71 2.82 2.54 2.25
## 5 080006 08 021 2.94 2.42 2.65 3.56 3.53 3.43 2.56
## 6 080007 08 021 3.47 2.48 2.65 3.56 3.50 3.34 2.30
## 7 080008 08 243 3.69 2.05 3.46 2.71 2.71 2.58 1.79
## 8 080009 08 045 2.77 1.63 3.46 3.00 2.86 2.73 1.95
## 9 080010 08 243 3.85 3.29 1.63 4.17 4.21 4.20 2.80
## 10 080013 08 022 3.78 0.993 4.85 2.71 2.71 2.53 2.25
## # ... with 111,746 more rows
mutate_at(
trees, vars(one_of(c('Especie', 'Species'))),
~ paste0('sp_', .x)
)## Warning: Unknown columns: `Species`
## # A tibble: 111,756 x 10
## Codi Provincia Especie Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3
## <fct> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 080001 08 sp_022 7 8.3 31.8 20 20.3 18.9 9
## 2 080002 08 sp_476 38 9.1 31.8 35 34 32.4 9
## 3 080003 08 sp_021 25 7 31.8 25 24.8 17.6 11
## 4 080004 08 sp_021 28 8.89 31.8 15 16.8 12.6 9.5
## 5 080006 08 sp_021 19 11.2 14.1 35 34.0 30.9 13
## 6 080007 08 sp_021 32 12 14.1 35 33.1 28.2 10
## 7 080008 08 sp_243 40 7.8 31.8 15 15 13.2 6
## 8 080009 08 sp_045 16 5.09 31.8 20 17.5 15.3 7
## 9 080010 08 sp_243 47 26.9 5.09 65 67.4 66.8 16.5
## 10 080013 08 sp_022 44 2.7 127. 15 15.1 12.6 9.5
## # ... with 111,746 more rows
Pipes and data pipelines (%>%)
Often, we will need to use several verbs (filter, arrange, group_by, summarise…) to manage our data towards a final result. However, multiple operations are difficult to read, or require to create multiple intermediate objects to achieve our goal:
In programming languages this is usually solved nesting functions one inside another, so that we applie and ordered sequence of instructions to our data:
diam_especie <- filter(
summarise(
group_by(
filter(
trees, !is.na(DiamIf3)
),
Codi, Especie
),
diam = mean(DiamIf3),
n = n()
),
n > 5
)no_na_trees <- filter(
trees, !is.na(DiamIf3)
)
no_na_trees_grouped <- group_by(
no_na_trees, Codi, Especie
)
summarised_no_na_trees <- summarise(
no_na_trees_grouped,
diam = mean(DiamIf3), n = n()
)
final_data <- filter(
summarised_no_na_trees, n > 5
)The tidy approach solves this issue, producing a way better coding experience (cleaner and easy to read) by means of its alternative pipe operator (%>%) from magrittr package
The name pipe refers to the ability of connecting the input on the left side passing it to the function in the right as its first argument:
f(x, y) is the same as x %>% f(y)
f(x, y, z) is the same as x %>% f(y, z)
Summing up, in the tidyverse %>% makes each function to be applied to the data frame resulting from the previous step.
filter(df, color == 'blue') is the same as df %>% filter(color == 'blue')
mutate(df, double = 2*value) is the same as df %>% mutate(double = 2*value)
Untidy
diam_especie <- filter(
summarise(
group_by(
filter(
trees, !is.na(DiamIf3)
),
Codi, Especie
),
diam = mean(DiamIf3),
n = n()
),
n > 5
)Tidy
diam_especie <- trees %>%
filter(!is.na(DiamIf3)) %>%
group_by(Codi, Especie) %>%
summarise(
diam = mean(DiamIf3),
n = n()
) %>%
filter(n > 5)Exercise 6
Create pipelines to answer the following questions:
6.1 Which **plots** have the fastest average growth rate?
6.2 Which is the plot with the **most species**?
6.3 Is there any **relationship** between both variables? <br>
*(Optional, some knowledge on `ggplot`is required)*
Grouped mutate/filter
We will commonly use groups (group_by) when summarising variables (n inputs, one output):
group_by(Especie) %>% summarise(mean = mean(Diam))Sometimes, however, we may be interested in calculating new variables by group, but without reducing the dimensions:
trees %>%
group_by(Especie) %>%
mutate(
std_diam = DiamIf3 - mean(DiamIf3)
)## # A tibble: 111,756 x 11
## # Groups: Especie [91]
## Codi Provincia Especie Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3 std_diam
## <fct> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 080001 08 022 7 8.3 31.8 20 20.3 18.9 9 -6.71
## 2 080002 08 476 38 9.1 31.8 35 34 32.4 9 13.5
## 3 080003 08 021 25 7 31.8 25 24.8 17.6 11 -0.555
## 4 080004 08 021 28 8.89 31.8 15 16.8 12.6 9.5 -8.50
## 5 080006 08 021 19 11.2 14.1 35 34.0 30.9 13 8.70
## 6 080007 08 021 32 12 14.1 35 33.1 28.2 10 7.75
## 7 080008 08 243 40 7.8 31.8 15 15 13.2 6 -5.86
## 8 080009 08 045 16 5.09 31.8 20 17.5 15.3 7 1.40
## 9 080010 08 243 47 26.9 5.09 65 67.4 66.8 16.5 46.5
## 10 080013 08 022 44 2.7 127. 15 15.1 12.6 9.5 -11.9
## # ... with 111,746 more rows
Exercise 7
7.1 Identify those trees that grew most as compared to the average in that plot <br>
.font80[(Hint: calculate growth, *then* mean growth by plot, and *then* the difference)]
7.2 Identify those plots where a species grows much more than the average for the species
Extra (in case you get bored):
7.3 Select IFN plots with pure Pinus nigra stands (Especie = 025). Note: we consider a forest to be monospecific when > 80% in BA corresponds to a single species
Joining information from two tables
Sometimes, especially when information comes from database structures, we will find our information split in two or more tables that share a common variable (primary/external key).
dplyroffers several choices to merge information, depending on our specific needs.
left_join(x, y): Add observations in y that also appears in x. Original observations (x) are not lost:
right_join(x, y)`: Add observations in x that also appears in y. Original observations (y) are not lost:
full_join(x, y): All observations, x and y:
inner_join(x, y): Only observations present in both x and y:
Exercise 8
Add X and Y coordinates that are included in the `coordinates` data frame to each plot in
the `plots` data frame
tidyr
Data in tidy format eases the processing and analysis, particularly in vectorized languages as R.
But data is not always properly organized and may incurr in several issues that harm its analysis. In the next example, the name of the colums are values rather than actual variables.
## # A tibble: 5,769 x 22
## iso2 year m04 m514 m014 m1524 m2534 m3544 m4554 m5564 m65 mu f04 f514 f014 f1524
## <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 AD 1989 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 2 AD 1990 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 3 AD 1991 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 4 AD 1992 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 5 AD 1993 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 6 AD 1994 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 7 AD 1996 NA NA 0 0 0 4 1 0 0 NA NA NA 0 1
## 8 AD 1997 NA NA 0 0 1 2 2 1 6 NA NA NA 0 1
## 9 AD 1998 NA NA 0 0 0 1 0 0 0 NA NA NA NA NA
## 10 AD 1999 NA NA 0 0 0 1 1 0 0 NA NA NA 0 0
## # ... with 5,759 more rows, and 6 more variables: f2534 <int>, f3544 <int>, f4554 <int>,
## # f5564 <int>, f65 <int>, fu <int>
The package tidyr brings in a number of functions to organize our data. This often involves converting table-like formats between wide to long formats.
Wide versus long data
The tidyverse and tidy data
Data in tidy format eases the processing and analysis, particularly in vectorized languages as R. Data has to meet some requirement in order to be considered tidy:
- Each row represents an observation
- Columns comprise variables
- Echa combination of row and column gathers a value
There are 5 examples of messy data we will explore here:
Column headers are values, not variable names.- Multiple variables are stored in one column.
- Variables are stored in both rows and columns.
- Multiple types of observational units are stored in the same table.
- A single observational unit is stored in multiple tables.
Illustrating the difference between wide and long datasets is easiest using an example:
country_long <- data.frame(
expand.grid(country = c("Sweden", "Denmark", "Norway"), year = 1994:1996),
avgtemp = round(runif(9, 3, 12), 0)
)
country_long## country year avgtemp
## 1 Sweden 1994 11
## 2 Denmark 1994 4
## 3 Norway 1994 6
## 4 Sweden 1995 12
## 5 Denmark 1995 11
## 6 Norway 1995 3
## 7 Sweden 1996 4
## 8 Denmark 1996 5
## 9 Norway 1996 11
country_wide <- data.frame(
country = c("Sweden", "Denmark", "Norway"),
avgtemp.1994 = country_long$avgtemp[1:3],
avgtemp.1995 = country_long$avgtemp[4:6],
avgtemp.1996 = country_long$avgtemp[7:9])
country_wide ## country avgtemp.1994 avgtemp.1995 avgtemp.1996
## 1 Sweden 11 12 4
## 2 Denmark 4 11 5
## 3 Norway 6 3 11
The long dataset separates the unit of analysis (country-year) into two separate variables while the wide dataset combines one of the keys (year) with the value variable (avgtemp).
A case for long data
There are many reasons to prefer datasets structured in long form. Repeating some of the points made in Wickham (2014), here are three reasons why you should attempt to structure your data in long form:
If you have many value variables, it is difficult to summarize wide-form datasets at a glance (which in turn makes it hard to identify mistakes in the data). For example, imagine we have a dataset with 50 years and 10 value variables of interest - this would result in 500 columns in wide form. Summarizing each column to look for strange observations, or simply understanding which variables are included in the dataset, becomes difficult in this case.
Structuring data as key-value pairs - as is done in long-form datasets - facilitates conceptual clarity. For example, in country_long above, it is clear that the unit of analysis is country-year - or, put differently, that the variables country and year jointly constitute the key in the dataset. In wide-form datasets, one of the variables that constitutes the unit of analysis is mixed with a variable that holds values. (Read more about this in Hadley’s paper referenced above.)
Long-form datasets are often required for advanced statistical analysis and graphing. For example, if you wanted to run a regression with year and/or country fixed effects, you would have to structure your data in long form. Furthermore, many graphing packages, including
ggplot2, rely on your data being in long form.
The function gather enables the conversion from wide to long.
## # A tibble: 35,750 x 5
## iso2 year sex age_group n
## <chr> <int> <chr> <chr> <int>
## 1 AD 1996 f 014 0
## 2 AD 1996 f 1524 1
## 3 AD 1996 f 2534 1
## 4 AD 1996 f 3544 0
## 5 AD 1996 f 4554 0
## 6 AD 1996 f 5564 1
## 7 AD 1996 f 65 0
## 8 AD 1996 m 014 0
## 9 AD 1996 m 1524 0
## 10 AD 1996 m 2534 0
## # ... with 35,740 more rows
species## # A tibble: 14,778 x 15
## # Groups: Codi, Especie [14,778]
## Codi Especie CD_10 CD_15 CD_20 CD_25 CD_30 CD_35 CD_40 CD_45 CD_50 CD_55 CD_60 CD_65 CD_70
## * <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 080001 022 0 159. 31.8 111. 60.1 19.2 5.09 0 0 0 0 0 0
## 2 080002 021 0 0 0 0 0 74.2 28.3 63.7 0 0 0 0 0
## 3 080002 022 0 0 0 173. 31.8 0 0 0 0 0 0 0 0
## 4 080002 476 0 0 0 0 0 31.8 0 0 0 0 0 0 0
## 5 080003 021 0 0 0 31.8 0 0 0 0 5.09 0 0 0 0
## 6 080003 022 0 127. 0 0 46.0 127. 0 14.1 0 0 0 0 0
## 7 080004 021 0 31.8 0 0 31.8 0 0 0 0 0 0 0 0
## 8 080005 071 0 0 0 14.1 5.09 31.8 0 0 0 0 31.8 0 0
## 9 080005 243 0 0 14.1 0 14.1 0 5.09 0 0 0 0 0 0
## 10 080005 278 0 286. 31.8 0 0 31.8 0 0 0 0 0 0 0
## # ... with 14,768 more rows
The verbs of tidyr
gather: Convert data from wide to long format (columns to id-value pairs)spread: Convert data from long to wide format (id-value pairs to columns)separate: Convert one column in serveralunite: Join several columns in one
n_parcelas <- tibble(
Prov = c('Lleida', 'Girona', 'Barcelona', 'Tarragona'),
IFN_2 = c(16, 78, 60, 34),
IFN_3 = c(18, 79, 67, 36)
)
n_parcelas## # A tibble: 4 x 3
## Prov IFN_2 IFN_3
## <chr> <dbl> <dbl>
## 1 Lleida 16 18
## 2 Girona 78 79
## 3 Barcelona 60 67
## 4 Tarragona 34 36
gather(df, key, value, vars)
n_parcelas_tidy <- gather(n_parcelas,IFN, n, IFN_2, IFN_3)n_parcelas## # A tibble: 4 x 3
## Prov IFN_2 IFN_3
## <chr> <dbl> <dbl>
## 1 Lleida 16 18
## 2 Girona 78 79
## 3 Barcelona 60 67
## 4 Tarragona 34 36
n_parcelas_tidy## # A tibble: 8 x 3
## Prov IFN n
## <chr> <chr> <dbl>
## 1 Lleida IFN_2 16
## 2 Girona IFN_2 78
## 3 Barcelona IFN_2 60
## 4 Tarragona IFN_2 34
## 5 Lleida IFN_3 18
## 6 Girona IFN_3 79
## 7 Barcelona IFN_3 67
## 8 Tarragona IFN_3 36
separate(df, col, into, sep)
n_parcelas_sep <- separate(n_parcelas_tidy, IFN, c('source', 'version'), sep = '_')
n_parcelas_sep## # A tibble: 8 x 4
## Prov source version n
## <chr> <chr> <chr> <dbl>
## 1 Lleida IFN 2 16
## 2 Girona IFN 2 78
## 3 Barcelona IFN 2 60
## 4 Tarragona IFN 2 34
## 5 Lleida IFN 3 18
## 6 Girona IFN 3 79
## 7 Barcelona IFN 3 67
## 8 Tarragona IFN 3 36
unite(data, col, vars, sep)
n_parcelas_unite <- unite(n_parcelas_sep, IFN, source, version, sep = '_')
n_parcelas_unite## # A tibble: 8 x 3
## Prov IFN n
## <chr> <chr> <dbl>
## 1 Lleida IFN_2 16
## 2 Girona IFN_2 78
## 3 Barcelona IFN_2 60
## 4 Tarragona IFN_2 34
## 5 Lleida IFN_3 18
## 6 Girona IFN_3 79
## 7 Barcelona IFN_3 67
## 8 Tarragona IFN_3 36
Exercise 9
Use `gather` and `separate` to transform the data frame *species* into a **tidy** format, where each column is a variable and each row an observation.
## # A tibble: 14,778 x 15
## # Groups: Codi, Especie [14,778]
## Codi Especie CD_10 CD_15 CD_20 CD_25 CD_30 CD_35 CD_40 CD_45 CD_50 CD_55 CD_60 CD_65 CD_70
## * <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 080001 022 0 159. 31.8 111. 60.1 19.2 5.09 0 0 0 0 0 0
## 2 080002 021 0 0 0 0 0 74.2 28.3 63.7 0 0 0 0 0
## 3 080002 022 0 0 0 173. 31.8 0 0 0 0 0 0 0 0
## 4 080002 476 0 0 0 0 0 31.8 0 0 0 0 0 0 0
## 5 080003 021 0 0 0 31.8 0 0 0 0 5.09 0 0 0 0
## 6 080003 022 0 127. 0 0 46.0 127. 0 14.1 0 0 0 0 0
## 7 080004 021 0 31.8 0 0 31.8 0 0 0 0 0 0 0 0
## 8 080005 071 0 0 0 14.1 5.09 31.8 0 0 0 0 31.8 0 0
## 9 080005 243 0 0 14.1 0 14.1 0 5.09 0 0 0 0 0 0
## 10 080005 278 0 286. 31.8 0 0 31.8 0 0 0 0 0 0 0
## # ... with 14,768 more rows
spread(df, key, value, sep)
n_parcelas2 <- spread(n_parcelas_unite, IFN, n)n_parcelas## # A tibble: 4 x 3
## Prov IFN_2 IFN_3
## <chr> <dbl> <dbl>
## 1 Lleida 16 18
## 2 Girona 78 79
## 3 Barcelona 60 67
## 4 Tarragona 34 36
]
n_parcelas2## # A tibble: 4 x 3
## Prov IFN_2 IFN_3
## <chr> <dbl> <dbl>
## 1 Barcelona 60 67
## 2 Girona 78 79
## 3 Lleida 16 18
## 4 Tarragona 34 36
Exercise 10
Use `unite` and `spread` to transform the data from exercise 9 into its original format
purrr
purrr allows to do functional programming with R, making loops and iterative tasks pipe-friendly and, thus, easier to read.
c(1, 2, 3) %>%
map_dbl(~ .*2)## [1] 2 4 6
c('a', 'b', 'c') %>%
map_chr(~ paste0('treatment_', .))## [1] "treatment_a" "treatment_b" "treatment_c"
Can be used as an alternative to apply family functions (lapply, vapply…) which are sometimes unclean and difficult to use.
purrr’s common structure
verb(.x, .f, ...)- first argument (.x): vector or list (included data frames)
- second argument (.f): function to apply to each element of .x
- output is always a vector or a list of the same length as .x
Main functions in purrr:
map: Tranforms the input by applying a function to each element and returning a vector the same length as the input.map: Returns always a listmap_chr,map_dbl,map_int,map_lgl: Returns a vector of the corresponding typemap_dfr,map_dfc: Returns a data frame created by row- or column-binding
Models workflow become easier
trees %>%
split(.$Provincia) %>%
map(~ lm(HeiIf3 ~ DiamIf3, data = .)) %>%
map_dfr(~ broom::tidy(.))## # A tibble: 8 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5.04 0.0464 109. 0
## 2 DiamIf3 0.275 0.00192 144. 0
## 3 (Intercept) 4.74 0.0526 90.2 0
## 4 DiamIf3 0.290 0.00211 137. 0
## 5 (Intercept) 5.56 0.0404 137. 0
## 6 DiamIf3 0.234 0.00147 160. 0
## 7 (Intercept) 3.75 0.0614 61.2 0
## 8 DiamIf3 0.238 0.00247 96.4 0
References
Ameztegui, A., and V. Granda. 2019. “Introduction to the Tidyverse.” GitHub Repository. https://github.com/ameztegui/SIBECOL_workshop; GitHub.
Wickham, Hadley. 2014. “Tidy Data.” The Journal of Statistical Software 59 (10). http://www.jstatsoft.org/v59/i10/.